CAGEF_services_slide.png

Introduction to Python for Data Science

Lecture 04: Exploratory data analysis and plotting

Student Name: Live Lecture HTML

Student ID: 220203


0.1.0 About Introduction to Python

Introduction to Python is brought to you by the Centre for the Analysis of Genome Evolution & Function (CAGEF) bioinformatics training initiative. This course was developed based on feedback on the needs and interests of the Department of Cell & Systems Biology and the Department of Ecology and Evolutionary Biology.

The structure of this course is a code-along style; It is 100% hands on! A few hours prior to each lecture, the materials will be available for download at QUERCUS and also distributed via email. The teaching materials will consist of a Jupyter Lab Notebook with concepts, comments, instructions, and blank spaces that you will fill out with Python code along with the instructor. Other teaching materials include an HTML version of the notebook, and datasets to import into Python - when required. This learning approach will allow you to spend the time coding and not taking notes!

As we go along, there will be some in-class challenge questions for you to solve either individually or in cooperation with your peers. Post lecture assessments will also be available (see syllabus for grading scheme and percentages of the final mark).

0.1.1 Where is this course headed?

We'll take a blank slate approach here to Python and assume that you pretty much know nothing about programming. From the beginning of this course to the end, we want to get you from some potential scenarios:

and get you to a point where you can:


0.2.0 Lecture objectives

Welcome to this fourth lecture in a series of seven. Today we will pick up where we left off last week with our merged data. We'll learn how to explore the data, summarize it, and plot it!

At the end of this lecture we will aim to have covered the following topics:

  1. Exploratory data analysis with long-format data
  2. Data plotting with the matplotlib.pyplot package.
  3. Advanced visualizations with the seaborn package.
  4. Saving visualizations to file.

0.3.0 A legend for text format in Jupyter markdown

grey background - a package, function, code, command or directory. Backticks are also use for in-line code.
italics - an important term or concept or an individual file or folder
bold - heading or a term that is being defined
blue text - named or unnamed hyperlink

... - Within each coding cell this will indicate an area of code that students will need to complete for the code cell to run correctly.

Blue box: A key concept that is being introduced
Yellow box: Risk or caution
Green boxes: Recommended reads and resources to learn Python
Red boxes: A comprehension question which may or may not involve a coding cell. You usually find these at the end of a section.

0.4.0 Data used in this lesson

Today's datasets will focus on using Python lists and the NumPy package

0.4.1 Dataset 1: subset_data_taxa_merged.csv

This is a copy of our formatted and merged data from Lecture 03. It's a smaller subset to the final for doing some simple exploratory data analysis towards the end of class. The original data was obtained from phase 1 of the Human Microbiome project. The HMP was a joint effort led by the American Institute of Health to characterize the microbiome of healthy human subjects at five major body sites using metagenomics (A.K.A. environmental DNA) and 16S ribosomal RNA amplicon gene sequencing. The dataset that we will use are the results of 16S rRNA gene sequencing, a technique where this gene is used as a marker for taxonomic identification and classification.


0.4.2 Dataset 2: Latrines

Sequencing of the V3-V5 hypervariable regions of the 16S rRNA gene

16S rRNA gene amplicon sequencing of 30 latrines from Tanzania and Vietnam at different depths (multiples of 20cm). Microbial abundance is represented in Operational Taxonomic Units (OTUs). Operational Taxonomic Units (OTUs) are groups of organisms defined by a specified level of DNA sequence similarity at a marker gene (e.g. 97% similarity at the V4 hypervariable region of the 16S rRNA gene). Intrinsic environmental factors such as pH, temperature, organic matter composition were also recorded.

We have 3 csv files.

0.4.2.1 File 1: data/metadata_pitlatrine.csv

A metadata file (Naming conventions: [Country_LatrineNo_Depth]) with sample names and environmental variables.

0.4.2.2 File 2: data/taxa_pitlatrine.csv

An OTU abundance table containing the numbers of operational taxonomic units sampled from various sites.

0.4.2.3 Data source

B Torondel, JHJ Ensink, O Gunvirusdu, UZ Ijaz, J Parkhill, F Abdelahi, V-A Nguyen, S Sudgen, W Gibson, AW Walker, and C Quince. Assessment of the influence of intrinsic environmental and geographical factors on the bacterial ecology of pit latrines Microbial Biotechnology, 9(2):209-223, 2016. DOI:10.1111/1751-7915.12334


0.5.0 Packages used in this lesson

IPython and InteractiveShell will be access just to set the behaviour we want for iPython so we can see multiple code outputs per code cell.

random is a package with methods to add pseudorandomness to programs

numpy provides a number of mathematical functions as well as the special data class of arrays which we'll be learning about today.

os pandas matplolib


1.0.0 An introduction to exploratory data analysis

Recall from last week, we spent our lecture converting data from the human microbiome project from wide-format to long-format. Now that the wrangling is completed, we can perform some exploratory data analysis (EDA). The process of EDA investigates your data to identify abnormalities, summarize its main characteristics and identify potential patterns or trends for further validation.

With our EDA today we will try to answer questions like:

  1. Which genus is most often encountered in our sampling?
  2. Which genus has the highest number of counts across our samples?
  3. Which body site hosts the most diverse OTUs?

1.0.1 Import our data from last week

Let's open a version of our dataset from last week. We'll find it in the file data/subset_data_taxa_merged.csv. Recall that this is a representative subset of the entire dataset. It is always good practice to explore your to find more about aspects such as probability distributions, outliers, and central tendency and dispersion measures.

Now we are going to do some exploratory data analysis (EDA) on subset_taxa_metadata_merged.csv, a subset of the merged dataset (0.001%).


1.1.0 Which genus is most often encountered across our sampling?

Let's begin with a deceptively simply question about our data. As we'll see, however, it required more than a simple method call to our data in order to discern. Rather we'll walk through the thought process so you can avoid potential pitfalls later in your own analyses.

1.1.1 Get a summary of our information

So we've imported a dataset of 999,000 rows across 16 columns. You'll note a new column named "OTU" that we didn't have in our prior dataset. Previously we had left this as an index when we were completing our merges but now it exists as it's own column. Recall that at the time of importing this we could have used the index_col parameter to import OTU as we had it before.

Time to pull up a summary of the numeric information. Remember that we can use the describe() method to accomplish this.


1.1.2 Use the describe() method to summarize non-numeric columns

Recall that we can also summarize our non-numeric data, to a certain extent, as long as we provide it properly to the describe() method. This method can identify the "top" occuring entry in a column as well as it's frequency. We already know that the OTU column contains each occuring OTU so it should be simple enough to just create a summary of that information. Let's give it a try.


1.1.3 Filter your data with conditional booleans

Last lecture we saw a few examples where we subset our data using a simple conditional statement like isna(). While we referred to this as "slicing" our data, you can also think of it as a way to filter data. We can of course, filter our data using other conditional statements and before summarizing your data, you should consider the nature of your values! In our original susbet data it appears that each OTU appears 1000 times.

Recall, our original dataset started off with 1000 observations, and ~2000 OTU columns. When we melted the data, this generated quite a large ~2M row dataset before merging with our taxa metadata which represented ~1000 OTUs. By the virtue of this process, each OTU could appear 1000 times in our data BUT it may not have actually occured within each sample!

To solve this conundrum we turn to filtering our data by the count column. This is a measurement of the number of occurences of a specific OTU in any sample! We'll do this in two steps:

  1. Filter the dataset by a conditional.
  2. Pass that dataset on for summarization.

So according to our quick describe, we see that OTU_97_10362 appears most often in our data set but is that the answer to our question? Looking at the total count, there are 249750 non-NA values and 999 unique values - or an average of 250 entries per OTU! Are we sure we summarized this correctly?


We've now come to the simple answer showing that OTU_97_10 appears most frequently with a count of > 0 within our data. What is the specific information about this OTU? We can filter for this information too!

Since the output of our summary is also an indexed object, we can pull out the top value as an attribute.


1.1.4 Know the difference between filtering and the filter() method

You may be wondering to yourself, surely there must be a filter() method implemented with the DataFrame. It seems like such an essential part of working with DataFrames. You're right that such a method exists but you would be wrong to think that it is used for conditional filtering.

The filter() method is used merely for subsetting your DataFrame by column name or by regular expression pattern (See Lecture 06). It does not use conditional logic to subset rows from your data. While this can be helpful in certain contexts, it does NOT implement the idea of filtering rows of our data based on conditional criteria.

Here's a quick example of how to use it.


1.2.0 Which genus has the highest number of total counts?

While we have discovered which specific OTU appears to be most frequently encountered amongst all the samples, it may be occuring commonly but in very low amounts. Furthermore, we may be missing some information from the big picture as many OTUs may really come from the same genus. This kind of information may be more helpful to understanding our microbiome data so how do we explore this?

1.2.1 Grouping your DataFrame using the groupby() method

Thinking about our problem, we have a GENUS column and we can take advantage of the values in this column by exploring data based on the unique values in GENUS. We could try to filter our data using a different conditional GENUS == Veillonella but we would have to somehow cycle through each potential value and summarize the subset. (Honestly this would have been my approach not so long ago!).

Luckily for you, the groupby() method can sort all of that data for you based on the criteria you provide. The important parameters to use today are:


1.2.2 Use the head() method to view rows from each group

As you can see above, we create a grouped DataFrame but if we attempted to look at it, it would look pretty much like the original merged_subset. The major difference is that the data has now been essentially sorted by the GENUS column. In order to view part of it, we can use the head(n) method which will return n rows from each group.


1.2.3 Subset and apply functions to your grouped DataFrame

As expected, we see 85 rows in our data mean there are 85 groups in our grouped DataFrame! Now that we have our groups we can begin to apply functions to summarize data from each group. Back to our question, we can now ask, which genera produces the largest number of total counts.

We have already encountered the sum() and max() functions to some degree so it is simply a matter of applying them to the correct subset of data. Recall that our grouped data will be indexed by genus so we can subset our count column and then apply functions to it. Let's see what that looks like step-by-step.


1.2.4 Use sort_values() to organize your data

Not exactly the result we were looking for. The max() function did reveal the largest value but we lost the index information at the same time, leaving us without a complete answer. Instead, what we can do is sort the resulting summed group data in descending order with the sort_values() method. To sort in descending or reverse order, we can set the ascending = False parameter.


We can now answer our original question for this section and say that the Prevotella genus has the highest total number of counts in our dataset with a total of 10351.

Section 1.2.4 Comprehension Question: While we may have identified the Prevotella as having the highest total number of counts, does this have much meaning? Perhaps it is widely identified at very low counts or rarely found, but highly abundant. How would you approach this question to find an answer.

1.3.0 Which body subsite has the most diverse set of genera?

Now that we've got a few very helpful tools under our belt, we can ask a similar question about the diversity of genera in within each body subsite. Let's begin by reviewing the basic information about our dataset


1.3.1 Plan your analysis strategy to avoid mistaken assumptions

Time to think about your dataset in relationship to your question. We know already that each genera may appear within any body subsite but it's count may be 0. Again, it will be important to filter our data before summarizing it. Then you must understand how to what you are looking for and how it is a part of your data.

  1. As with our first question, we should filter our data before proceeding to summarize it.
  2. Group the data by the particular column you are interested in. In this case HMP_BODY_SUBSITE.
  3. Count the number of unique occurences of each genus within each group.

Steps 1 and 2 should be straighforward. We'll use the first() method to look at the first entry of each group.


1.3.2 Broadcast the unique() method across groups

Now that we know how to group our data by body subsite, what we really want to do is find the unique genera within each group. We already know we can use the unique() method but as we'll see the results are not quite what we expected. The nature of the grouped data generates an np.array object as a result of our following code.


1.3.3 Use the dropna() method to remove data with NA values

As we can see above we have some NA values in our output. In order to avoid accidentally counting these values in our final analysis, we should remove them. There is an easy way to clean our entire dataset by removing NA values. Using the dropna() method we can remove either rows or columns that contain a single or all NA values. Some of the relevant parameters for us in this method are:


Now we just need to use the a function or method to determine the length of the array. Our choices of course, are the len() function or the .size property. Let's try them both and see what we get.


1.3.4 Use the apply() method to broadcast a function to individual elements

Why did both of our sets of code result in a single value - 18? This is actually the number of rows in our series (i.e. the number of HMP_BODY_SUBSITE unique values. That's not at all what we want but with our code as it is, that's what we asked for. In both cases, after our call to unique() we have generated a Series object. Calling on both length() and .size gives us information about the Series but what we want is to retrieve information about the values stored in that object.

Recall that DataFrame objects can broadcast mathematical operations? We can apply a mathematical operation to each individual element of DataFrame. Likewise we can broadcast more complex functions on individual elements by using the apply() method.

The apply() method take the form of apply(func, args=(), **kwargs). We'll talk a little more about some of these parameters in later lectures but for now:

In our case, we want to apply the .size attribute to each array in our Series. You'll see one more unfamiliar piece of code: lambda. This is how we can tell Python we want to make a quick function on the spot. We'll cover this more in a later lecture as well so for now we'll just roll with it.


1.3.5 Use nunique() on a grouped dataframe to return the number of unique elements

Looking at our code above, we went through 5 steps to get a final answer:

What if instead, we used a helpful method - nunique() to count the number of unique elements in our grouped dataframe. This method simplifies the process by combining the unique() and len() methods together (which we have used previously to achieve the same goal). Furthermore you can ignore the NA values by default. Using this method simplifies our process slightly into just 3 lines of code as we'll see.


1.3.6 Use the agg() function to generate a summary from multiple functions

Rather than using just a single function like nunique() you can actually apply multiple functions to a grouped dataframe to generate a summary of the information. In this case we are still only looking at the GENUS column so we are a little limited by the kind of summary we can achieve from string data. We can, however, still count() the number of entries in each group.

To combine both of these methods to produce a summary, we'll use the agg() method, which will accept a list of function names like "count" and "nunique" but also "sum", "min", "max" and other methods found in the GroupBy object.

How do I know what to use? The GroupBy object has a number of it's own methods and the only way to get better at know what to use, is by studying the commands at your disposal. For a brief description of the various methods available, you can check out the online Pandas reference manual.

1.3.7 Find the right tool to save yourself time

So after a couple of possible directions, we find that the answer is that the throat has the most diversity of genera across our samples with 34 (out of 84 unique genera encountered). We also discovered that while methods like apply() give you a lot of flexibility in what you'd like to calculate with your data, that sometimes there is already a specialized tool like nunique() that accomplishes what you want with less coding on your part!

After more practice and experience some of these functions will become second nature to your coding choices.

Section 1.0.0 Comprehension Question Now that you know how to group data by specific variables, how would you calculate the number of groups created when combining the SEX and HMP_BODY_SUBSITE variables? How many total groupings are there and which group has the largest number of entries with a count above 0? Use the code cell below to help generate an answer.

2.0.0 Plotting with the pyplot module

2.0.1 Use exploratory plots to assess your data

Now that we've had a chance to look at our data close up, let's talk about how we can use exploratory plots to give us a quick visual assessment of our data. We can use these visualizations to help make decisions about how to further analyse our data. Is there a difference between different groups of data? Does it look like there might be any bias between our datasets? What does the overall distribution of our sampling look like?

2.0.2 Types of plots

Often when trying to convey a message about our data through a visualization, we want to choose the right kind of visualization. These visualizations can also be referred to as figures or plots. Within the maplotlib package is the pyplot module. It is a collection of functions that give the matplotlib package capabilities that are very similar to the programming language MATLAB. The pyplot module has functions that can create some of the following basic plots:

Plot type Command What to use it for
Bar plot / histogram bar() Population data summaries. Helpful for contrasting between groups
Scatter plot scatter() Multiple independent measurements across different variables
Line plot plot() Multiple measurements that represent the same sample(s)
Histogram hist() Generate a distribution by binning your data
Stem or lollipop stem() A twist on the bar plot that may be more compact and visually pleasing
Boxplots boxplot() Create a visual summary of your datapoints based on their distribution
Violin plots violinplot() Create a visual kernel density (distribution) estimate of datapoints

2.0.3 Plot components

Within each plot are a number of basic components: titles, axis properties, legends, etc. Here is a helpful table outlining some of the basic plot components.

Component Description Command Parameters
Title The title of your plot title()
X- or Y-axis title The axis titles of your plot xlabel(), ylabel() xlabel=str
loc={'left', 'center', 'right'}
Text properties
X- or Y-axis ticks Alter your axis tick positions/locations and labels xticks(), yticks() ticks=[a,..n]
labels=[label1, ..., labeln]
Axis limits A list defining the x- and y-axis limits axis() [xMin, xMax, yMin, yMax]
Axis scale Set the kind of axis scale for your data xscale(), yscale() "linear", "log", "symlog", "logit"
Text properties Labels can take text parameters too color, fontsize, fontstyle, rotation

2.1.0 Build a basic barplot with the bar() method

We'll use our body subsite results as an example to try and plot some of our data as a bar plot. The bar() method generally requires two sets of data to be supplied along with some optional data:

Let's start by building a basic barplot and see what needs to be altered as we move forward. Note that we use the plt.show() method to display our plot after putting a lot of pieces together.


2.1.1 Use the figure() to set your figure size

As we can see from our first attempt, the plot is rather small. We definitely have some problems with the basics of this plot and we'll address the first one that might help. This figure could be a little larger so we can see the x-axis labels better. Use the figure() method to set your figure size using the figsize parameter.

2.1.2 Rotate your x-axis text with the xticks() method

Now that our plot is larger, we can still see the x-axis labels are still too large. We can, however, rotate the text and see if that helps. Let's rotate to a 90-degree angle. We'll alter these axis properties through the xticks() method.


2.1.3 Add labels to your plot

Okay, the plot is larger and we've fixed our x-axis label issues. No more overcrowding! Let's add a main title and axis titles to our dataset. We can use the title(), xlabel(), and ylabel() methods in this case. While we set the labels, we can also set their properties such as fontsize, fontstyle, and color.


2.1.4 Alter properties of your barplot

Similar to altering text properties, most of the plots have the ability to alter various properties like fill and line color or other plot-specific attributes. Let's update the fill and line colours for our barplot using the color and edgecolor parameters.

Section 2.0.0 Comprehension Question What if we would like to change our x-axis to limit the number of subsites we are displaying? How would you plot just the first 10 values using the plt.axis() method? Modify the code below to try and alter the x-axis limits. For more information on how to use this method check out the documentation.
Learn more about Pyplot: We've barely scratched the surface of working with the Pyplot module and depending on your needs it may be enough to get along on. You can find more at the matplotlib tutorial for pyplot.

3.0.0 Basic visualizations with the seaborn package

Building upon our visualizations in the last section, there are some common themes you might recognize about them. We have a plot area, x- and y-axis data, axis limits, and plot colors. Using matplotlib to help generate your visualizations, you can control many small details but it can also be tedious at times to micromanage so many aspects of your plot.

The seaborn package is actually built upon the pyplot module and tries to bring a high-level approach to statistical plots. As we'll see later on, this means updating certain details of our plots will require an understanding of the base matplotlib and pyplot functions.

3.0.1 The seaborn package subdivides plot types into 3 categories

The seaborn package takes a dual-pronged approach to generating plots. There are functions considered to work at the Figure level and then there are functions that affect what is known as the Axes level. To simplify the concept:

At the Axes levels, there are 3 categories of plot types based on their similarity: relational, distribution, and categorical. For each of these categories there is a figure-level function that can be used to create multi-panel (faceted) versions of these plots by splitting the data further by categorical variables.

seaborn_function_overview_8_0.png
From the seaborn overview: for most simple plots, one of the above figure or axes-level plots can be utilized.

The above functions are used to initialize figure and axes objects by identifying a number of properties. Within these, some of the options can vary greatly based on plot type. Of the two levels of functions, their influence on figure attributes can vary:

3.1.0 Introduction to the Grammar of Graphics

One approach to effective data visualization relies on the Grammar of Graphics framework originally proposed by Leland Wilkinson (2005).The idea of grammar can be summarized as follows:

ggplot2.png

The grammar of graphics is a language to communicate about what we are plotting programmatically


3.1.1 The grammar of graphics with seaborn

The grammar of graphics facilitates the concise description of any components of any graphics. Hadly Wickham of R tidyverse fame has proposed a variant on this concept - the layered grammar of graphics framework in the ggplot2 package for R. By following a layered approach of defined components, it can be easy to build a visualization.

In a similar manner, the seaborn package has some methods that facilitate a layering approach to building your visualizations. However, many of the details are built upon the foundation of layering Axes objects or alterations upon Figure objects.

Each Axes-level function usually takes in:

  1. Data: your visualization always starts here. What are the dimensions you want to visualize. What aspect of your data are you trying to convey?

  2. Aesthetics: assign your axes based on the data dimensions you have chosen. Where will the majority of the data fall on your plot? Are there other dimensions (such as categorically encoded groupings) that can be conveyed by aspects like size, shape, colour, fill, etc.

  3. Geometric objects: how will you display your data within your visualization. Which *plot will you use?

The figure-level methods can be used to alter or update:

  1. Scale: do you need alter your x or y-axis limits? What about scaling/transforming any values to fit your data within a range? Sometimes, depending on the Geometric object, you are better off transforming your data ahead of time.

  2. Facets: will generating subplots of the data add a dimension to your visualization that would otherwise by lost?

  3. Coordinate system: will your visualization follow a classis cartesian, semi-log, polar, etc. coordinate system?

Let's jump into our first dataset and start building some plots with it shall we?

3.1.2 Import metadata_pitlatrine

Before we dig into the seaborn package, we will import a new dataset that has a much more diverse array of data that we can use for showcasing the plotting power of seaborn.


Taking a quick look at our data we can summarize it briefly here and see that there are a few categories we can explore across variables like Country and Depth for measured variables like pH, Temp, and TS (total solids) and CODt (total chemical oxygen demand).

3.2.0 Build a scatterplot using the relplot() method

We'll start by building a basic scatterplot. We'll focus on comparing the total chemical oxygen demand versus the total solids count. Rather than working at the axes-level, we'll work with the encompassing relplot() method which will give us flexibility in our visual exploration down the road.

For the basic plot relplot(), we'll start with the following parameters:


Looking at the output we can see that without having specified the x and y axis values, it simply plotted all of the variables along a default x-axis of index number (80 rows total). Let's try again by setting our axis variables.


3.2.1 Specify colouring by category with the hue parameter

Now we begin to see the power of having a tidy DataFrame. Since each of our observations is in its own row, we can classify each observation by factors like Country! Using the seaborn package, we can specify the colour of our points using the hue parameter. Since we have set the data = pitlatrine_metadata parameter, we can tell seaborn to look at a specific column when determining the hue parameter.

At the same time, we'll set the alpha parameter which is essentially the opacity of each datapoint. Setting a lower value increases transparency which allows us to see overlapping datapoints better. This parameter becomes especially helpful when working with extremely dense datapoints.


3.2.2 Alter your axis scale with the set method

The set() method is a gateway to altering a number of aspects of your plot. Once we have our plot saved as an object named snsplot we can alter or set some of it's properties this way. In particular we will be using the yscale parameter to set adjust the y-axis log scale.

Note that our plot object is actually kept as a type of matplotlib.axes.Axes and that's where we are calling the set() method from. We can actually set quite a few figure attributes through this method.

More on figure attributes: For a full list of plot attributes you can update through this method you can visit the Matplotlib help page.

3.2.3 Facet your data into different plots with the relplot() method

If we want to split our data into multiple new plots based on certain variables, this is known as faceting your data. Usually this results in a grid-like pattern where data is grouped by categories of one variable as columns, and another variable by rows. Although the data could simply be split on a single variable instead. Either way, this generates a figure-level object known as a FacetGrid().

The relplot() method already has the capability to handle this splitting of axes within the figure it generates. The relplot() method can facet data across two variables using the row and col parameters. We simply need to name the variable(s) that will be used to categorize the data.

To summarize, the parameters to use for this operation are:

Below, we'll remove the colouring of points based on Country and instead, split the data into two Axes based on this information.


3.2.4 Use a continuous variable to colour your data

Note that we typically need to just apply one attribute to each dimension of data we are investigating. By splitting the data by Country we no longer need to colour it based on this category. We can, however, add additional information to our visualization by using another dimension in our data. Instead of colouring the points based on a categorical variable (Tanzania vs Vietnam), we can use a continuous variable like Temp from our dataset to see if there could be a trend in relation to our data.

It is easy enough to set this dimension using the hue parameter in our initial relplot() call. We'll also set the palette parameter to a different colour.

By colouring our datapoints by temperature, we can now see more clearly on the same plot that latrines samples from Tanzania generally are measured at a higher temperature. Rather than generating additional plots comparing different pairs of variables, we've simply added an additional dimension of information to our visualization.


3.2.5 Update your axis using the set_axis_labels() method

The names of our axis titles are drawn from the variable names we used for the original DataFrame but we may be limited in how those variables are originally named. In other cases you may wish to add units, or simply make your axis titles more descriptive. To accomplish this we can alter our labels directly using the set_axis_labels() method. The parameters to set are x_var and y_var in that order. Set them directly if you only want to change a single axis title.


3.2.6 Set the shape of your points with the style parameter

We'll switch gears a little at this point and ask what our data looks like when we compare the Depth of our latrine measurements instead. There are many depth measurements (11 total) which will be very hard to read if they are presented in a single row so we'll also use the col_wrap parameter to set the number of facets per row. Note that this will not be compatible with faceting by both a column and row variable.

We'd still like to distinguish between our two different Country values so we'll assign the shape of these values using the style parameter instead.


3.3.0 Plotting theoretical distributions

Now that we have some of the basics, it's time to take a closer look at using other types of plots. We'll begin by reviewing the latrines dataset after loading it into memory as the variable latrines.

3.3.1 Look at your distribution as a kernel density estimate with kdeplot() or displot()

There are a lot of datapoints in our new dataset. This time our measurements are based on OTU counts for specific taxa within different latrines across Tanzania and Vietnam. We already looked at a lot of the metadata about each latrine samples in the pitlatrine_data from our previous plots. A quick question we can ask about the data is "what is the overall distribution of OTU counts in our data?"

A quick way to answer this question is be generating a kernel density estimate (KDE) using the displot() method from seaborn. We'll need to provide an x value (OTUs in this case) and we'll colour our plots based on Country.


3.3.2 Trim your data by filtering it

Well it looks like the majority of our distribution is close to a value of 0, and our OTUs can range as high as 17,572! Let's simplify the dataset we'll look at by filtering it. We can do so right in our call to kdeplot(). In this case we'll set the OTU values to be between 2 and 1000 by filtering using two sets of conditionals and combining them with the &. Be sure to use parentheses correctly here to ensure that we are evaluating things properly and separating our expressions correctly for the Python kernel.


3.3.3 Add a rugplot() to the margin

Within seaborn there are a few ways marginal plots that can be added to your visualizations. Marginal plots usually add distribution summaries like a histogram, kde or in our case, a rugplot. More specifically, we'll be using the rugplot() method but there is also the ability to create certain plot combinations using the jointplot() method.

A rugplot is simply a series of vertical or horizontal tick-marks representing our actual data points along the x- and/or y-axis. For our rugplot, we'll be add it outside of our plot area by manipulating the height and clip_on parameters to help us out. We'll also set the alpha parameter to help us see the density of our tick marks a little better. Despite where it is plotted, we are adding this plot on the underlying Axes object of the current snsPlot figure.


3.4.0 Boxplots provide visual summary statistics of your data

Boxplots are a great way to visualize summary statistics for your data. As a reminder, the thick line in the center of the box is the median. The upper and lower ends of the box are the first and third quartiles (or 25th and 75th percentiles) of your data. The whiskers extend to the largest value no further than 1.5*IQR (inter-quartile range - the distance between the first and third quartiles). Data beyond these whiskers are considered outliers and plotted as individual points. This is a quick way to see how comparable your samples or variables are.

We are going to use boxplots to see the distribution of OTUs per Taxa across all samples.

3.4.1 Generate a basic boxplot with the catplot() method

To build the basic boxplot we begin with the main variables. We want to summarize the OTU counts (y-axis) for each taxa (x-axis) to get a sense of the OTU count distribution for each taxon. We can use the catplot() method to build our visualization. You'll note that the plot is automatically coloured to differentiate between the x-axis groups.

The catplot() method is the gateway to more categorical plots and behaves similarly to the other two figure-level plots we've encountered.


3.4.2 Rotate your x-axis text directly through pyplot with xticks()

We've encountered this problem before the the solution is actually the same. Recall that seaborn is built upon the back of the pyplot module so we can actually modify the plot directly through pyplot!

While we're here, let's update the y-axis scale as we have before using the set() method on our figure object.


3.4.3 Be careful when adjusting your y-axis scale

Uh Oh! Something strange has happened to our data. Recall our data is filled with 0 counts! When we try to log-transform a 0 what is the result? Call on np.log(0) to convince yourself that you've created a problem with your visualization.

The problem is that our data has already been plotted as a boxplot, all of the data has been calculated. We then transform all of the data values so the boxplot summary values are merely transformed and redrawn in the same way. Having -inf values and trying to calculate these statistics correctly fails miserably because of the existence of 0 values.

Therefore we must filter our data for 0 values beforehand to avoid this issue. We'll sort our taxa as well to get them in alphabetical order for our boxplot. On top of this, however, you might surmise that our interquartile ranges, when warped along a log-transform scale might not make sense anymore. Instead, we should log-transform our data before plotting so that the summary statistics of the data can be properly calculated for that particular scale.

To accomplish this data transformation and creation of a new column, we'll use the assign() method which allows us to create a new column within the DataFrame using the syntax `DataFrame.assign(new_col=expression_calculation)


3.4.4 Generate nested boxplots using the hue parameter

We already know that our data is measured across two countries so we can take advantage of this information to create a nested (paired/grouped) set of boxplots. When you have a smaller number of categories, this allows you to more directly compare the characterstics of your two populations. Let's see what happens when we use the hue parameter to distinguish between our country-level data.

We'll also set the boxplot() parameter width to put a little more distance between each category along the x-axis. This value usually ranges between 0 and 1.

To save on some space we'll additionally move the legend into the plot using the legend_out boolean parameter.


3.4.5 Facet your crowded data with the row and col parameters

So our plot above is quite busy. Whereas previously, we used the relplot() method to generate a faceted scatterplot (or relational plot), here we'll use the catplot() method to accomplish something similar. The catplot() method handles the distribution or faceting of categorical plots using similar parameters:

This time around we'll facet our data into a stacked set of plots using the row parameter. You'll notice that since we are no longer grouping by hue that the legend will also disappear.

It's time we fixed that y-axis label as well using the set_ylabels method. We should convey that the data scale is log10.


3.4.7 Visualize a boxplot with datapoints using swarmplot()

Even though boxplots give us summary statistics on our data, it is useful to readers (and reviewers) to be able to see where our individual data points are. We've already used rugplot() to help visualize our data distribution in density plots.

Similarly, for a boxplot we can add the data as another layer using swarmplot() to place dots on top of our boxplot. A swarmplot places data points that are overlapping next to each other, so we can get a better picture of the distribution of our data.

For simplicity we will subset the data to make a boxplot with 3 Taxa so that we can see the differences it makes when adding a complementary geom like this.


3.4.7.1 Separate your swarmpoints points by hue when using grouped boxplots

Now that we have our base boxplot, we'll add the swarmplot as a layer. It's important to note that we are adding this layer after our initial call to catplot() but before we update our axis label. You can try on your own to see what happens if we add the swarmplot data after re-labeling the y-axis.

A couple of notes about the parameters we are using with the swarmplot:

We'll also update the fliersize parameter in our catplot() to hide our outliers. Otherwise we'll be plotting those points twice when we add the swarmplot.


3.5.0 Use the violin plot to visualize distributions

If you could combine aspects of the boxplot and the KDE into a single visualization, you would think it was the violin plot. Another way to think of the violin plot is as the KDE plot that's been shrunk down and placed categorically.

It's actually quite easy to switch over since many of the aspects are similar to the boxplot. We need only change the kind parameter in our catplot() code.


3.5.1 Combine a binary grouping with the split parameter

Sometimes a more direct comparison of your data can be applied through the violin plot by generating a split version of it. This is especially helpful when you are working with nested data that is binary and you would like to compare it visually.

We'll initialize this visualization with the split boolean parameter. To help with this visualization we'll also include some inner markers of the quartile information in each violin.


4.0.0 Saving your plots to file

Up until now, we have taken for granted that our plots have been displayed using a Graphic Device. For our Jupyter Notebooks we can see the graphs right away and update our code. You can even save them manually from the output display but sometimes you may be producing multiple visualizations based on large data sets. In this case it is preferable to save them directly to file.

4.1.0 Save your figures with plt.savefig() method

Once you have a figure the way you want it, you can save in any number of graphical and non-graphical formats. The savefig() method from the pyplot package is here to save the day. Saving the current figure, you can use some of the following parameters:

Let's save our faceted scatterplot from way back in section 3.2.6.


5.0.0 Plotting multi-panel figures

Up until now we've been generating a combination of either faceted plots or simply layering elements upon single plots. Throughout all of these we have not really been mixing the types of plots we've generated. Luckily for us, the matplotlib.pyplot module provides a means for us to put together multiple plot axes in a single figure.

We have already seen some of these figure-level functions in action with relplot() and catplot() which provided an interface to axes-level methods like scatterplot() or boxplot when created faceted plots.

seaborn_function_overview_8_0.png
A handy figure from the seaborn overview at: https://seaborn.pydata.org/tutorial/function_overview.html

What if, however, we would like to create a figure with multiple axes of different types?

5.1.0 Use subplot2grid() to generate a multi-grid plot

When generally considering the layout of our data, we want to think of breaking up the figure into a grid. This can start off simply as a 1x1 panel and expand outwards with nested panels within a 2x2 or 3x3 or larger figure. The dimensions also are not limited to square shapes but can be rectangular as well.

The subplot2grid takes the following parameters:

ggarrange_examples.png
Some simple layouts demonstrating how a figure can be subdivided

Let's make a figure with 3 plots as seen in the 4th example above. We'll begin just by generating the specific layout.

5.1.1 Use the Axes objects as a canvas to plot onto fig

Now you can see that our figure encompasses the 3 panels that we envisioned. You'll notice that we named each panel with a reference/variable. We could have also added them to a single list object to save on variable names but to simplify our understanding they've been named separately.

Why do we need these objects? Each Axes-level function we use to plot with, can take a parameter ax where we pass along an Axes object. This identifies which panel we want to plot onto in our overall figure. Otherwise it will use the last Axes object generated (ie ax3). We'll fill our Axes as follows:

There is, however, a quick hitch. We can no longer rely directly on the figure-level functions of relplot, catplot and displot. In order to plot correctly on these subpanels, we'll need to use the axes-level function calls instead. We'll add our plots one at a time so you can see the effect of each.


5.1.2 Add multiple layers to the second Axes

Next we'll populate the second panel (top-right) with a combination of violinplot() and swarmplot(). We'll adjust the swarmplot slighltly to make the size of the points smaller as well using the size parameter.


5.1.3 Remove the legend from your plot with Axes methods

So you can see there's a slight issue above with our legend in the violin plot. Due to size issues, it's dropped right into the bottom-middle of the plot. It doesn't really have much meaning for us so we will remove it with the get_legend().remove() command. This will access the Axes object legend and remove it for us.


5.1.4 Add a KDE plot to our final panel

Let's complete the set by adding a KDE plot to our final panel. We'll filter our dataset on the fly as we pass it to the kdeplot() function.


5.1.5 Adjust spacing between Axes using tight_layout()

Nearly there, we can see there are some overlapping text issues from the y-axis labels of the KDE plot onto the scatter plot beside it. This is due to the size of the y-tick text itself pushing the title too far left. We can ask the plot to fix these spacing issues using the plt.tight_layout() method, which should fix the overlapping issues as best as it can.

Not too shabby! And we're done!


6.0.0 Class summary

That's our fourth class on Python! You've made it through and we've learned about taking advantage of in-built DataFrame methods for exploratory data analyis as well as how to finally visualize some of your data:

  1. Exploratory data analysis with groupby() and summarizing.
  2. Basic plots with the matplotlib.pyplot module.
  3. Advanced visualizations with the seaborn package.
  4. Saving your figures to file

6.1.0 Submit your completed skeleton notebook (2% of final grade)

At the end of this lecture a Quercus assignment portal will be available to submit your completed skeletons from today (including the comprehension question answers!). These will be due one week later, before the next lecture. Each lecture skeleton is worth 2% of your final grade but a bonus 0.7% will also be awarded for submissions made within 24 hours from the end of lecture (ie 1700 hours the following day).

6.2.0 Post-lecture DataCamp assessment (8% of final grade)

Soon after the end of each lecture, a homework assignment will be available for you in DataCamp. Your assignment is to complete the Introduction to Data Visualization with Seaborn course (4 chapters, 3700 possible points). This is a pass-fail assignment, and in order to pass you need to achieve a least 2775 points (75%) of the total possible points. Note that when you take hints from the DataCamp chapter, it will reduce your total earned points for that chapter.

In order to properly assess your progress on DataCamp, at the end of each chapter, please take a screenshot of the summary. You'll see this under the "Course Outline" menubar seen at the top of the page for each course. It should look something like this:

DataCamp.example.png
A sample screen shot for one of the DataCamp assignments. You'll want to combine yours into single images or PDFs if possible

Submit the file(s) for the homework to the assignment section of Quercus. This allows us to keep track of your progress while also producing a standardized way for you to check on your assignment "grades" throughout the course.

You will have until 13:59 hours on Thursday, February 10th to submit your assignment (right before the next lecture).


6.3.0 Acknowledgements

Revision 1.0.0: materials prepared for CSB1021H S LEC0140, 01-2022 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.


The Center for the Analysis of Genome Evolution and Function (CAGEF)

The Centre for the Analysis of Genome Evolution and Function (CAGEF) at the University of Toronto offers comprehensive experimental design, research, and analysis services in microbiome and metagenomic studies, genomics, proteomics, and bioinformatics.

From targeted DNA amplicon sequencing to transcriptomes, whole genomes, and metagenomes, from protein identification to post-translational modification, CAGEF has the tools and knowledge to support your research. Our state-of-the-art facility and experienced research staff provide a broad range of services, including both standard analyses and techniques developed by our team. In particular, we have special expertise in microbial, plant, and environmental systems.

For more information about us and the services we offer, please visit https://www.cagef.utoronto.ca/.

CAGEF_new.png